Modeling is a key step in the pipeline of statistical inference. In this series of lectures we will learn how to fit parametric and non-parametric models to our data.
In the parametric case we will focus on the classical frequentist method to fit models: maximum likelihood estimation (MLE)
Then we will review non-parametric modeling techniques such as the Histogram and Kernel density estimation
After that we will go back to parametric modeling and extend MLE using priors: Maximum a posteriori
Finally we will present key ideas on bayesian modeling that are to be further developed in a future series of lectures
2.1.1 What is statistical inference?
Inference is:
To draw conclusions from facts through a scientific premise
In the particular case of Statistical inference we have
Facts: Observed data
Premise: Probabilistic model
Conclusion: An unobserved quantity of interest
Objective: Quantify the uncertainty of the conclusion given the data and the premise
The following are examples examples of statistical inference tasks:
Parameter estimation: What is the best estimate of a model parameter based on the observed data?
Confidence estimation: How trustworthy is our point estimate?
Hypothesis testing: Is the data consistent with a given hypothesis or model?
2.1.2 The difference between parametric and non-parametric models
To conduct inference we start by defining a statistical model
Models can be broadly classified as:
Parametric:
It corresponds to an analytical function (distribution) with free parameters
Has an a-priori fixed number of parameters
In general: Stronger assumptions, easier to interpret, faster to use
Non-parametric:
Distribution-free model but they do have parameters and assumptions (e.g. dependence)
The number of parameters depends on the amount of training data
In general: More flexible, harder to train
2.1.3 The difference between Frequentist and Bayesian inference
There are two main paradigms or perspectives for statistical inference: Frequentist (F) or classical and Bayesian (B). There are conceptual differences between these paradigms, for example
Definition of probability:
F: Relative frequency of an event. An objective property of the real world
B: Degree of subjective belief. Probability statements can be made not only on data but also on parameters and models themselves
Interpretation of parameters:
F: They are unknown and fixed constants
B: They have distributions that quantify the uncertainty of our knowledge about them. We can compute expected values of the parameters
In this lecture we will focus on the frequentist approach to modeling. We will review the bayesian perspective in a future lesson
2.2 Frequentist approach on parametric modeling
In the case of parametric inference we assume that observations follow a certain distribution, i.e. observations are a realization of a random process (sampling)
The conceptual (iterative) steps of parametric inference are:
Model fitting: Find parameters by fitting data to the current model
Model proposition: Propose a new model that accommodates important features of the data better than the previous one
In the frequentist approach step 1 is typically solved using Maximum Likelihood Estimation (MLE). Other frequentist alternatives are the Method of Moments (MoM) and the M-estimator. Only MLE is covered in this lecture
2.2.1 The likelihood
The likelihood function is a quantitative description of our experiment (measuring process), and it is the starting point for parametric modeling for both the frequentist and bayesian paradigms. In simple terms the likelihood tells us how good our model is with respect to the observed data
Let’s now give a mathematical description of the likelihood
Suppose we have an experiment that we model as a set of R.Vs \(X_1, X_2, \ldots, X_N\)
We also have observations/realizations from our R.Vs \(\{x_i\} = x_1, x_2, \ldots, x_N\)
We assume that the R.Vs follow a certain joint probability distribution \(f(x_1, x_2, \ldots, x_N | \theta)\) with parameters \(\theta\)
For the examples in this lecture we will additionally assumme that our observations are independent and identically distributed (iid). With this we can simplify the previous expression as
The value of the likelihood itself does not hold much meaning, but it can be used to make comparisons between different values of the parameter vector \(\theta\). The larger the likelihood the better the model, i.e. likelihood maximization allows us to find the best \(\theta\) for our data
Before continuing consider the following
Likelihood is not probability
The likelihood of a set of RVs does not integrate (or sum in the discrete case) to unity, i.e. in general the likelihood is not a valid probability density function.
The likelihood by itself cannot be interpreted as a probability of \(\theta\), it only tells us how likely is that \(\{x_i\}\) was generated by the distribution \(f\) with parameter \(\theta\)
2.3 Maximum Likelihood Estimation (MLE)
In parametric modeling we are interested in finding \(\theta\) that best fit our observations.
One method to do this is MLE:
1 Select a distribution (model) for the observations and formulate the likelihood \(\mathcal{L}(\theta)\)
2 Search for \(\theta\) that maximizes \(\mathcal{L}(\theta)\) given the data, i.e.
where the point estimate \(\hat \theta\) is called the maximum likelihood estimator of \(\theta\)
After this we can
3 Determine the confidence region of \(\hat \theta\) either analytically or numerically (bootstrap, cross-validation, etc)
4 Make conclusions about your model (hypothesis test)
A wrong assumption in step 1 can ruin your inference. Evaluate how appropriate your model is, compare with other models and suggest incremental improvements
2.3.1 Example: MLE for the mean of a Gaussian distribution
Let us consider a set of N measurements \(\{x_i\}_{i=1,\ldots, N}\) obtained from a sensor and that all were obtained under the same conditions
The sensor used to measure the data has an error that follows a Gaussian distribution with known variance \(\sigma^2\)
If the conditions are the same then the measurements can be viewed as noisy realizations of the true value \(\mu\)
We can find \(\mu\) by maximizing the likelihood given \(\{x_i\}\)
In many cases it is more practical to find the maximum of the logarithm of the likelihood (e.g. distributions from the exponential family). Logarithm is a monotonic function and its maximum is the same as its argument.
If the true mean is not known then this is a biased estimator of the true variance. MLE does not guarantee unbiased estimators
The following code example shows how the value of the MLEs of these parameters evolve as more data is observed
Code
import holoviews as hvhv.extension('bokeh')import numpy as npimport scipy.signal# data from the sensornp.random.seed(1234)x =80+ np.random.randn(10000)#x = 80 + 2*np.random.rand(1000) # What happens if the data is not normal# Computing the MLENs = np.round(np.logspace(0, 4, num=16)).astype('int')hat_mu = np.array([np.mean(x[:N]) for N in Ns])hat_s2 = np.array([np.mean((x[:N]-mu)**2) for mu, N inzip(hat_mu, Ns)])mu_plot = hv.Curve((Ns, hat_mu), 'Number of samples', 'hat mu')*hv.HLine(80).opts(color='k')s2_plot = hv.Curve((Ns, hat_s2), 'Number of samples', 'hat s2')*hv.HLine(1).opts(color='k')(mu_plot + s2_plot).opts(hv.opts.Curve(logx=True, width=350), hv.opts.HLine(line_dash='dashed'))
/home/phuijse/.conda/envs/quarto/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Unable to display output for mime type(s):
2.3.3 A note on biased and unbiased estimators
For a parameter \(\theta\) and an estimator \(\hat \theta\), if
\[
\mathbb{E}[\hat \theta] = \theta,
\]
then \(\hat \theta\) is an unbiased estimator of \(\theta\)
In this case we cannot obtain analytical expressions for the parameters by setting the derivative to zero. We have to resort to iterative methods/optimizers, e.g. gradient descent, expectation maximization #
We will come back to this in a future class on expectation maximization and Gaussian mixture models
2.4 Optimality properties and uncertainty of MLEs
Assuming that the data truly comes from the specified model the MLE is
Consistent: The estimate converge to the true parameter as data points increase
\[
\lim_{N\to \infty} \hat \theta = \theta
\]
Asymptotically normal: The distribution of the estimate approaches a normal centered at the true parameter.
Because MLE have asymptotically normal distributions the log likelihood ratio have asymptotically a chi-square distributions (more about this later)
Minimum variance: The estimate achieve the theoretical minimal variance given by the Cramer-Rao bound. This bound is the inverse of the expected Fisher information, i.e the second derivative of \(- \log L\) with respect to \(\theta\)
If \(-2 \log \lambda(\mathcal{D})\) is greather than the \((1-\alpha)100\%\) quantile of \(\chi^2_1\) we reject the null
2.6 Criteria for model comparison
How to compare models with different number of parameters? In general the more number of parameters the better the fit (overfitting). The likelihood does not take into account the complexity (number of parameters) of the model
How to score models taking into account their complexity?
One option is to use the Akaike information criterion (AIC). For a model with \(k\) parameters and N data points the AIC is
which one seeks to minimize. The AIC combines the likelihood (score) of the model and its complexity. The AIC is based on an asumptotic approximation which we will review in the future
This is also related to the idea of regularization, which will also be reviewed in future lectures